home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Cream of the Crop 21
/
Cream of the Crop 21 (Terry Blount) (October 1996).iso
/
sound
/
rsynth22.zip
/
TEXT710.DOC
< prev
next >
Wrap
Text File
|
1994-09-19
|
50KB
|
836 lines
A DESCRIPTION OF A COMPUTER-USABLE DICTIONARY FILE BASED ON
THE OXFORD ADVANCED LEARNER'S DICTIONARY OF CURRENT ENGLISH
Roger Mitton,
Department of Computer Science,
Birkbeck College,
University of London,
Malet Street,
London WC1E 7HX
June 1992 (supersedes the versions of March and Nov 1986)
In 1985-86 I produced a dictionary file called CUVOALD (Computer
Usable Version of the Oxford Advanced Learner's Dictionary). This was
a partial dictionary of English in computer-usable form - "partial"
because each entry contained only some of the information from the
original dictionary, and "computer-usable" (rather than merely
"computer-readable") because it was in a form that made it easy for
programs to access it. A second file, called CUV2, was produced at
the same time. This was derived from CUVOALD and was the same except
that it also contained all inflected forms explicitly, eg it contained
"added", "adding" and "adds" as well as "add". I have now added some
information to each entry and some more entries to CUV2, to produce a
new version of CUV2. This document describes this new file.
These files were derived originally from the Oxford Advanced
Learner's Dictionary of Current English [1], third edition, published
by the Oxford University Press, 1974, the machine-readable version of
which is available to researchers from the Oxford Text Archive. The
task of deriving them from the machine-readable OALDCE was carried out
as part of a research project, funded by the Leverhulme Trust, into
spelling correction. The more recent additions have been carried out
as part of my research as a lecturer in Computer Science at Birkbeck
College.
THE FILE FORMAT
CUV2 contains 70646 entries. Each entry occupies one line.
Samples are given at the end of this document. The longest spelling
is 23 characters; the longest pronunciation is also 23; the longest
syntactic-tag field is also (coincidentally) 23; the number of
syllables is just one character ('1' to '9'), and the longest
verb-pattern field is 58. The fields are padded with spaces to the
lengths of the longest, ie 23, 23, 23, 1 and 58, making the record
length 128. The spelling begins at position 1, the pronunciation at
position 24, the syntactic-tag field at position 47, the number of
syllables is character 70, and the verb-pattern field begins at
position 71. The file is sorted in ASCII sequence; this means, of
course, that the entries are not in the same order as in the OALDCE.
Page 2
WHAT THE DICTIONARY CONTAINS
Each entry consists of a spelling, a pronunciation, one or more
syntactic tags (parts-of-speech) with rarity flags, a syllable count,
and a set of verb patterns for verbs.
The first file derived from the OALDCE (CUVOALD) contained all
the headwords and subentries from the original dictionary - subentries
are words like "abandonment" which comes under the headword "abandon"
- except for a handful that contained funny characters (such as "Lsd"
where the "L" was a pound sign). Subentries were not included if they
consisted of two or three separate words that occurred individually
elsewhere in the dictionary, such as "division bell" which comes under
the headword "division", except when the combination formed a
syntactic unit not immediately predictable from its constituents, eg
"above board", which is listed as an adverb. To this list of about
35,000 entries, I added about 2,500 proper names - common forenames,
British towns with a population of over 5,000, countries,
nationalities, states, counties and major cities of the world. I
would like to have added many more proper names, but I didn't have the
time.
The second version of the file (CUV2) contained all these entries
plus inflected forms making a total of about 68,000 entries. Since
1986 I have made a number of corrections, added the rarity flags and
the syllable counts and inserted about 2,000 new entries. The new
entries, nearly all of which were derived forms of words already in
the dictionary, were selected from a list of several thousand words
that occurred in the LOB Corpus[3] but were not in CUV2. I also made
changes to existing entries where these were implied by the new
entries; for example, when adding a plural form of a word whose
existing tag was "uncountable", it was necessary to change the tag of
the singular form. I also added about 300 reasonably common
abbreviations (see note below).
A number of words (ie spellings) have more than one entry in the
OALDCE, eg "water 1" (noun) and "water 2" (verb). In CUV2, each word
has only one entry unless it has two different pronunciations, eg
"abuse" (noun and verb). I have departed from this rule in the case
of compound adjectives, such as "hard-working", which have a slightly
different stress pattern depending on whether they are used
attributively ("she's a hard-working girl") or predicatively ("she's
very hard-working"). These are entered only once; they generally have
the attributive stress pattern except when the predicative one seemed
the more natural. (See also the note below on abbreviations.) I have
also given only one entry to those words that have strong and weak
forms of pronunciation, such as "am" (which can be pronounced &m, @m
or m). Generally it is the strong form that is entered.
As regards the coverage of the dictionary, readers might be
interested in a paper by Geoffrey Sampson [4] in which he analyses a
set of words from a sample of the LOB Corpus[3] that were not in CUV2.
The recent additions should have gone some way to plugging the gaps
that his study identified.
Page 3
THE SPELLINGS
The spelling contains the characters "A" to "Z", "a" to "z",
hyphen, apostrophe, space, umlaut or diaeresis (HEX 22), cedilla (3C),
circumflex (5E), acute (5F), grave (60) and tilde (7E). These
diacritic characters precede the letter that they mark, eg "se~nor".
(There are also the characters "5" and "6" in "MI5" and "MI6".)
THE PRONUNCIATIONS
The pronunciation uses a set of characters very like the one
adopted by the Alvey Speech Club for representing IPA in ASCII [2].
The system is as follows:
i as in bead N as in sing
I bid T thin
e bed D then
& (ampsnd) bad S shed
A bard Z beige
0 (zero) cod tS etch
O (cap O) cord dZ edge
U good
u food p t k b d g
V bud m n f v s z
3 (three) bird r l w h j
@ "a" in about
eI as in day R-linking (the sounding
@U go of a /r/ at the end of a
aI eye word when it is
aU cow followed by a vowel)
oI boy is marked R
I@ beer eg fAR for "far"
e@ bare (compare "far away"
U@ tour with "far beyond").
Primary stress: apostrophe eg @'baUt ("about")
Secondary stress : comma eg ,&ntI'septIk
Plus-sign as in "courtship" and "bookclub"
'kOt+Sip 'bUk+klVb
When the spelling contains a space and/or a
hyphen, the pronunciation has one also, eg
above board @,bVv 'bOd air-raid 'e@-reId
THE SYNTACTIC TAGS
Every entry in the dictionary has at least one syntactic tag
(part-of-speech code). If an entry has more than one (eg "report"
noun and verb), they are in ASCII order and separated by commas. A
code consists of three characters, the first two being the syntactic
tag and the third a frequency class. The first is one of the capital
letters "G" to "Z" (inclusive), which have the following meanings:
Page 4
G Anomalous verb
H Transitive verb
I Intransitive verb
J Both transitive and intransitive verb
K Countable n